Data Mining Week 6: Gradient Boosted Decision Trees

We will start with classification techniques in gradient boosted decision trees

In [3]:
import os
os.getcwd()
Out[3]:
'/Users/matthewberezo'
In [4]:
# Read in csv file for Surgical Deepnet data that is stored in path:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
surg_df = pd.read_csv('/Users/matthewberezo/Documents/Surgicaldeepnet.csv')
In [5]:
pd.set_option("display.max_rows", None)
In [6]:
surg_df.head()
Out[6]:
bmi Age asa_status baseline_cancer baseline_charlson baseline_cvd baseline_dementia baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary ahrq_ccs ccsComplicationRate ccsMort30Rate complication_rsi dow gender hour month moonphase mort30 mortality_rsi race complication
0 19.31 59.2 1 1 0 0 0 0 0 0 0 0 19 0.183370 0.007424 -0.57 3 0 7.63 6 1 0 -0.43 1 0
1 18.73 59.1 0 0 0 0 0 0 0 0 0 0 1 0.312029 0.016673 0.21 0 0 12.93 0 1 0 -0.41 1 0
2 21.85 59.0 0 0 0 0 0 0 0 0 0 0 6 0.150706 0.001962 0.00 2 0 7.68 5 3 0 0.08 1 0
3 18.49 59.0 1 0 1 0 0 1 1 0 0 0 7 0.056166 0.000000 -0.65 2 1 7.58 4 3 0 -0.32 1 0
4 19.70 59.0 1 0 0 0 0 0 0 0 0 0 11 0.197305 0.002764 0.00 0 0 7.88 11 0 0 0.00 1 0
In [7]:
surg_df.shape
Out[7]:
(14635, 25)
In [8]:
surg_df.describe()
Out[8]:
bmi Age asa_status baseline_cancer baseline_charlson baseline_cvd baseline_dementia baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary ahrq_ccs ccsComplicationRate ccsMort30Rate complication_rsi dow gender hour month moonphase mort30 mortality_rsi race complication
count 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000 14635.000000
mean 31.295642 63.205268 0.632320 0.262316 0.977520 0.620294 0.004851 0.120875 0.189546 0.342740 0.082405 0.094090 7.428493 0.133570 0.004447 -0.699044 1.606970 0.548890 10.171613 5.915408 1.187086 0.003963 -0.836712 0.919440 0.252135
std 8.152709 18.088191 0.539952 0.439909 1.758355 0.485330 0.069485 0.325993 0.391955 0.474642 0.274990 0.291963 6.949455 0.088402 0.004579 1.339394 1.497738 0.497621 2.659881 3.239825 1.158357 0.062830 1.194111 0.364663 0.434253
min 2.150000 6.100000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.016118 0.000000 -4.720000 0.000000 0.000000 6.070000 0.000000 0.000000 0.000000 -3.820000 0.000000 0.000000
25% 26.510000 51.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.081977 0.001962 -1.970000 0.000000 0.000000 7.820000 3.000000 0.000000 0.000000 -2.250000 1.000000 0.000000
50% 28.980000 59.700000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 0.105720 0.002959 -0.580000 1.000000 1.000000 9.120000 7.000000 1.000000 0.000000 -0.640000 1.000000 0.000000
75% 35.295000 74.700000 1.000000 1.000000 2.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 13.000000 0.183370 0.007398 0.000000 3.000000 1.000000 12.050000 8.000000 2.000000 0.000000 0.000000 1.000000 1.000000
max 92.590000 90.000000 2.000000 1.000000 13.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 21.000000 0.466129 0.016673 12.560000 4.000000 1.000000 18.920000 11.000000 3.000000 1.000000 4.400000 2.000000 1.000000

Binary Classification for 30-day mortality in XGBoost

As discussed, XGBoost needs little data preparation. Since all of our data is already numerically represented, we can simply separate it into train, validate, and test sets and throw the data into XGBoost dense matrices

In [9]:
# Again, we split the data into training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(surg_df.drop(columns = ['mort30']), 
                                                    surg_df['mort30'], 
                                                    test_size=0.2, 
                                                    random_state=1)

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size = 0.2, random_state = 1)
In [10]:
import xgboost as xgb
In [11]:
dtrain = xgb.DMatrix(data = x_train, label = y_train)
dval = xgb.DMatrix(data = x_val, label = y_val)
dtest = xgb.DMatrix(data = x_test, label = y_test)
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
In [12]:
param = {'max_depth':3,
         'eta': 0.35,
         'silent':1,
         'objective':'binary:logistic',
         'eval_metric': 'logloss'
         #,'gamma': ???,
         #,'lambda': ???,
         #,'alpha': ???,
         #,'min_child_weight': ???,
         #,'colsample_bytree' :??? 
         #,colsample_bynode' : ???
         #,'scale_pos_weight' : ???
         ,'maximize' : 'FALSE'
         ,'n_jobs' : -1
         #,'base_score' : ???
         #,'max_delta_step' : ???
        }
In [13]:
# specify validations set to watch performance
watchlist = [(dtrain, 'train'), (dval, 'eval')]
num_round = 25 #This is another hyperparameter of sorts
bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds = 10)
[0]	train-logloss:0.406275	eval-logloss:0.40819
Multiple eval metrics have been passed: 'eval-logloss' will be used for early stopping.

Will train until eval-logloss hasn't improved in 10 rounds.
[1]	train-logloss:0.262492	eval-logloss:0.265369
[2]	train-logloss:0.177126	eval-logloss:0.181074
[3]	train-logloss:0.122914	eval-logloss:0.127885
[4]	train-logloss:0.087228	eval-logloss:0.093064
[5]	train-logloss:0.063151	eval-logloss:0.06926
[6]	train-logloss:0.046706	eval-logloss:0.053263
[7]	train-logloss:0.035407	eval-logloss:0.042588
[8]	train-logloss:0.027553	eval-logloss:0.035271
[9]	train-logloss:0.022105	eval-logloss:0.030594
[10]	train-logloss:0.018235	eval-logloss:0.027416
[11]	train-logloss:0.015531	eval-logloss:0.02517
[12]	train-logloss:0.013599	eval-logloss:0.023437
[13]	train-logloss:0.012203	eval-logloss:0.022209
[14]	train-logloss:0.01118	eval-logloss:0.02156
[15]	train-logloss:0.010391	eval-logloss:0.021234
[16]	train-logloss:0.009852	eval-logloss:0.020934
[17]	train-logloss:0.009399	eval-logloss:0.020914
[18]	train-logloss:0.008523	eval-logloss:0.021102
[19]	train-logloss:0.008054	eval-logloss:0.021439
[20]	train-logloss:0.007578	eval-logloss:0.021544
[21]	train-logloss:0.007105	eval-logloss:0.021793
[22]	train-logloss:0.006766	eval-logloss:0.021731
[23]	train-logloss:0.006615	eval-logloss:0.021912
[24]	train-logloss:0.006388	eval-logloss:0.022037
In [14]:
mort_train_w_preds = x_train
mort_train_w_preds['xgb_probs'] = bst.predict(dtrain)

mort_test_w_preds = x_test
mort_test_w_preds['xgb_probs'] = bst.predict(dtest)
In [15]:
from sklearn import metrics
y = y_test
scores = mort_test_w_preds['xgb_probs']
In [16]:
fpr, tpr, thresholds = metrics.roc_curve(y, scores)
metrics.auc(fpr, tpr)
Out[16]:
0.9161149526751997
In [17]:
# We can visualize these ROC curves with matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

plt.plot(roc_curve(y_train, mort_train_w_preds['xgb_probs'])[0],roc_curve(y_train, mort_train_w_preds['xgb_probs'])[1], 
         color = 'blue', label='Train ROC Curve (area = %0.2f)' % roc_auc_score(y_train, mort_train_w_preds['xgb_probs']))

plt.plot(roc_curve(y_test, mort_test_w_preds['xgb_probs'])[0],roc_curve(y_test, mort_test_w_preds['xgb_probs'])[1], 
         color = 'red', label='Test ROC Curve (area = %0.2f)' % roc_auc_score(y_test, mort_test_w_preds['xgb_probs']))


plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
<Figure size 640x480 with 1 Axes>

Training XGBoost with eval metric as AUC:¶

In [18]:
# Set our parameters
param_auc = {'max_depth':3,
         'eta': 0.35,
         'silent':1,
         'objective':'binary:logistic',
         'eval_metric': 'auc'
         #,'gamma': ???,
         #,'lambda': ???,
         #,'alpha': ???,
         #,'min_child_weight': ???,
         #,'colsample_bytree' :??? 
         #,colsample_bynode' : ???
         #,'scale_pos_weight' : ???
         ,'maximize' : 'TRUE'
         ,'n_jobs' : -1
         #,'base_score' : ???
         #,'max_delta_step' : ???
        }
In [19]:
# specify validations set to watch performance
watchlist = [(dtrain, 'train'), (dval, 'eval')]
num_round = 25 #This is another hyperparameter of sorts
bst_auc = xgb.train(param_auc, dtrain, num_round, watchlist, early_stopping_rounds = 10)
[0]	train-auc:0.649597	eval-auc:0.642213
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 10 rounds.
[1]	train-auc:0.698963	eval-auc:0.748481
[2]	train-auc:0.747826	eval-auc:0.818483
[3]	train-auc:0.811932	eval-auc:0.886184
[4]	train-auc:0.827315	eval-auc:0.885355
[5]	train-auc:0.872567	eval-auc:0.918308
[6]	train-auc:0.874938	eval-auc:0.918293
[7]	train-auc:0.875129	eval-auc:0.917909
[8]	train-auc:0.921897	eval-auc:0.947364
[9]	train-auc:0.938746	eval-auc:0.9458
[10]	train-auc:0.953444	eval-auc:0.944511
[11]	train-auc:0.953529	eval-auc:0.942593
[12]	train-auc:0.95389	eval-auc:0.942256
[13]	train-auc:0.967209	eval-auc:0.941259
[14]	train-auc:0.969111	eval-auc:0.940722
[15]	train-auc:0.986484	eval-auc:0.961478
[16]	train-auc:0.988232	eval-auc:0.952672
[17]	train-auc:0.990183	eval-auc:0.961018
[18]	train-auc:0.990529	eval-auc:0.960435
[19]	train-auc:0.992973	eval-auc:0.97168
[20]	train-auc:0.995055	eval-auc:0.969364
[21]	train-auc:0.994748	eval-auc:0.966295
[22]	train-auc:0.995591	eval-auc:0.972048
[23]	train-auc:0.996401	eval-auc:0.970698
[24]	train-auc:0.996847	eval-auc:0.967507
In [20]:
mort_train_w_preds['xgb_probs_auc'] = bst_auc.predict(dtrain)
mort_test_w_preds['xgb_probs_auc'] = bst_auc.predict(dtest)
In [21]:
# We can visualize these ROC curves with matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

plt.plot(roc_curve(y_train, mort_train_w_preds['xgb_probs_auc'])[0],roc_curve(y_train, mort_train_w_preds['xgb_probs_auc'])[1], 
         color = 'blue', label='Train ROC Curve (area = %0.2f)' % roc_auc_score(y_train, mort_train_w_preds['xgb_probs_auc']))

plt.plot(roc_curve(y_test, mort_test_w_preds['xgb_probs_auc'])[0],roc_curve(y_test, mort_test_w_preds['xgb_probs_auc'])[1], 
         color = 'red', label='Test ROC Curve (area = %0.2f)' % roc_auc_score(y_test, mort_test_w_preds['xgb_probs_auc']))


plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

Training XGBoost with eval metric as AUCPR:¶

In [22]:
# Set our parameters
param_aucpr = {'max_depth':3,
         'eta': 0.01,
         'silent':1,
         'objective':'binary:logistic',
         'eval_metric': 'aucpr'
         #,'gamma': ???,
         #,'lambda': ???,
         #,'alpha': ???,
         #,'min_child_weight': ???,
         #,'colsample_bytree' :??? 
         #,colsample_bynode' : ???
         #,'scale_pos_weight' : ???
         ,'maximize' : 'TRUE'
         ,'n_jobs' : -1
         #,'base_score' : ???
         #,'max_delta_step' : ???
        }
In [23]:
watchlist = [(dtrain, 'train'), (dval, 'eval')]
bst_aucpr = xgb.train(param_aucpr, dtrain, num_round, watchlist, early_stopping_rounds = 10)
[0]	train-aucpr:0.278878	eval-aucpr:0.207802
Multiple eval metrics have been passed: 'eval-aucpr' will be used for early stopping.

Will train until eval-aucpr hasn't improved in 10 rounds.
[1]	train-aucpr:0.317593	eval-aucpr:0.317066
[2]	train-aucpr:0.322798	eval-aucpr:0.322122
[3]	train-aucpr:0.317593	eval-aucpr:0.317066
[4]	train-aucpr:0.325144	eval-aucpr:0.3649
[5]	train-aucpr:0.325144	eval-aucpr:0.3649
[6]	train-aucpr:0.317593	eval-aucpr:0.317066
[7]	train-aucpr:0.325144	eval-aucpr:0.3649
[8]	train-aucpr:0.325144	eval-aucpr:0.3649
[9]	train-aucpr:0.325144	eval-aucpr:0.3649
[10]	train-aucpr:0.325144	eval-aucpr:0.3649
[11]	train-aucpr:0.325144	eval-aucpr:0.3649
[12]	train-aucpr:0.325144	eval-aucpr:0.3649
[13]	train-aucpr:0.325144	eval-aucpr:0.3649
[14]	train-aucpr:0.325144	eval-aucpr:0.3649
Stopping. Best iteration:
[4]	train-aucpr:0.325144	eval-aucpr:0.3649

In [24]:
mort_train_w_preds['xgb_probs_aucpr'] = bst_aucpr.predict(dtrain)
mort_test_w_preds['xgb_probs_aucpr'] = bst_aucpr.predict(dtest)
In [25]:
# We can visualize these ROC curves with matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

plt.plot(roc_curve(y_train, mort_train_w_preds['xgb_probs_aucpr'])[0],roc_curve(y_train, mort_train_w_preds['xgb_probs_aucpr'])[1], 
         color = 'blue', label='Train ROC Curve (area = %0.2f)' % roc_auc_score(y_train, mort_train_w_preds['xgb_probs_aucpr']))

plt.plot(roc_curve(y_test, mort_test_w_preds['xgb_probs_aucpr'])[0],roc_curve(y_test, mort_test_w_preds['xgb_probs_aucpr'])[1], 
         color = 'red', label='Test ROC Curve (area = %0.2f)' % roc_auc_score(y_test, mort_test_w_preds['xgb_probs_aucpr']))


plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

LightGBM Examples for binary classification:

In [26]:
import lightgbm as lgb
lgb_mort_train = lgb.Dataset(x_train.drop(columns = ['xgb_probs', 'xgb_probs_auc', 'xgb_probs_aucpr'])
                                          , y_train)
lgb_val_train = lgb.Dataset(x_val, y_val)
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler.
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
  "You can install the OpenMP library by the following command: ``brew install libomp``.", UserWarning)
In [27]:
lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'max_depth' : 3,
    #'num_leaves' : ???
    'learning_rate': 0.1,
    #'num_threads' : -1,
    #'scale_pos_weight' : ???
    'early_stopping_round' : 10,
    # min_data_in_leaf = ???,
    # pos_bagging_fraction = ???,
    # neg_bagging_fraction = ???,
    # bagging_freq = ???,
    # max_delta_step = ???,
    #'top_rate' : ???
    #'other_rate' : ???
    #'lambda_l1' : ???
    #'lambda_l2' : ???
}
In [28]:
lgb_gbm = lgb.train(params = lgb_params, train_set = lgb_mort_train,
                num_boost_round = 100, valid_sets = [lgb_val_train, lgb_mort_train],
               valid_names = ['Evaluation', 'Train'])
[1]	Train's binary_logloss: 0.0238565	Evaluation's binary_logloss: 0.0306558
Training until validation scores don't improve for 10 rounds.
[2]	Train's binary_logloss: 0.0432346	Evaluation's binary_logloss: 0.0730987
[3]	Train's binary_logloss: 0.0417709	Evaluation's binary_logloss: 0.0704192
[4]	Train's binary_logloss: 0.0410337	Evaluation's binary_logloss: 0.0692148
[5]	Train's binary_logloss: 0.0399304	Evaluation's binary_logloss: 0.0683451
[6]	Train's binary_logloss: 0.0393017	Evaluation's binary_logloss: 0.0677408
[7]	Train's binary_logloss: 0.0389159	Evaluation's binary_logloss: 0.0673647
[8]	Train's binary_logloss: 0.0384593	Evaluation's binary_logloss: 0.0669506
[9]	Train's binary_logloss: 0.0375877	Evaluation's binary_logloss: 0.0668378
[10]	Train's binary_logloss: 0.0372138	Evaluation's binary_logloss: 0.066574
[11]	Train's binary_logloss: 0.036977	Evaluation's binary_logloss: 0.066135
Early stopping, best iteration is:
[1]	Train's binary_logloss: 0.0238565	Evaluation's binary_logloss: 0.0306558
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:123: UserWarning: Found `early_stopping_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
In [29]:
y_probs_train = lgb_gbm.predict(x_train.drop(columns = ['xgb_probs', 'xgb_probs_auc', 'xgb_probs_aucpr']))
y_probs_test = lgb_gbm.predict(x_test)
In [30]:
fpr, tpr, thresholds = metrics.roc_curve(y_train, y_probs_train)
metrics.auc(fpr, tpr)
Out[30]:
0.8675646243930305
In [31]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_probs_test)
metrics.auc(fpr, tpr)
Out[31]:
0.877163944877642
In [32]:
#### LightGBM Example with different evaluation metric:
lgb_params_auc = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'max_depth' : 3,
    #'num_leaves' : ???
    'learning_rate': 0.1,
    #'num_threads' : -1,
    #'scale_pos_weight' : ???
    'early_stopping_round' : 10,
    # min_data_in_leaf = ???,
    # pos_bagging_fraction = ???,
    # neg_bagging_fraction = ???,
    # bagging_freq = ???,
    # max_delta_step = ???,
    #'top_rate' : ???
    #'other_rate' : ???
    #'lambda_l1' : ???
    #'lambda_l2' : ???
}
In [33]:
lgb_gbm_auc = lgb.train(params = lgb_params_auc, train_set = lgb_mort_train,
                num_boost_round = 100, valid_sets = [lgb_val_train, lgb_mort_train],
               valid_names = ['Evaluation', 'Train'])
[1]	Train's auc: 0.867565	Evaluation's auc: 0.914872
Training until validation scores don't improve for 10 rounds.
[2]	Train's auc: 0.61074	Evaluation's auc: 0.662417
[3]	Train's auc: 0.629786	Evaluation's auc: 0.66412
[4]	Train's auc: 0.62912	Evaluation's auc: 0.663875
[5]	Train's auc: 0.650857	Evaluation's auc: 0.663061
[6]	Train's auc: 0.653592	Evaluation's auc: 0.66231
[7]	Train's auc: 0.652098	Evaluation's auc: 0.659242
[8]	Train's auc: 0.653158	Evaluation's auc: 0.659564
[9]	Train's auc: 0.669036	Evaluation's auc: 0.656265
[10]	Train's auc: 0.683551	Evaluation's auc: 0.684033
[11]	Train's auc: 0.686347	Evaluation's auc: 0.689801
Early stopping, best iteration is:
[1]	Train's auc: 0.867565	Evaluation's auc: 0.914872

CatBoost examples for binary classification:

In [34]:
surg_df.head()
Out[34]:
bmi Age asa_status baseline_cancer baseline_charlson baseline_cvd baseline_dementia baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary ahrq_ccs ccsComplicationRate ccsMort30Rate complication_rsi dow gender hour month moonphase mort30 mortality_rsi race complication
0 19.31 59.2 1 1 0 0 0 0 0 0 0 0 19 0.183370 0.007424 -0.57 3 0 7.63 6 1 0 -0.43 1 0
1 18.73 59.1 0 0 0 0 0 0 0 0 0 0 1 0.312029 0.016673 0.21 0 0 12.93 0 1 0 -0.41 1 0
2 21.85 59.0 0 0 0 0 0 0 0 0 0 0 6 0.150706 0.001962 0.00 2 0 7.68 5 3 0 0.08 1 0
3 18.49 59.0 1 0 1 0 0 1 1 0 0 0 7 0.056166 0.000000 -0.65 2 1 7.58 4 3 0 -0.32 1 0
4 19.70 59.0 1 0 0 0 0 0 0 0 0 0 11 0.197305 0.002764 0.00 0 0 7.88 11 0 0 0.00 1 0

The easiest way to do this in CatBoost is to first change our categorical variables to strings:

In [35]:
x_train_cat = x_train.drop(columns = ['xgb_probs', 'xgb_probs_auc', 'xgb_probs_aucpr'])
x_val_cat = x_val
x_test_cat = x_test.drop(columns = ['xgb_probs', 'xgb_probs_auc', 'xgb_probs_aucpr'])

x_train_cat[['asa_status', 'baseline_cancer', 'baseline_charlson', 'baseline_cvd', 'baseline_dementia',
        'baseline_diabetes', 'baseline_digestive', 'baseline_osteoart', 'baseline_psych', 'baseline_pulmonary',
        'dow', 'gender', 'month', 'moonphase', 'race', 'complication']] = x_train_cat[['asa_status', 'baseline_cancer', 'baseline_charlson', 'baseline_cvd', 'baseline_dementia',
        'baseline_diabetes', 'baseline_digestive', 'baseline_osteoart', 'baseline_psych', 'baseline_pulmonary',
        'dow', 'gender', 'month', 'moonphase', 'race', 'complication']].astype(str)

x_val_cat[['asa_status', 'baseline_cancer', 'baseline_charlson', 'baseline_cvd', 'baseline_dementia',
        'baseline_diabetes', 'baseline_digestive', 'baseline_osteoart', 'baseline_psych', 'baseline_pulmonary',
        'dow', 'gender', 'month', 'moonphase', 'race', 'complication']] = x_val_cat[['asa_status', 'baseline_cancer', 'baseline_charlson', 'baseline_cvd', 'baseline_dementia',
        'baseline_diabetes', 'baseline_digestive', 'baseline_osteoart', 'baseline_psych', 'baseline_pulmonary',
        'dow', 'gender', 'month', 'moonphase', 'race', 'complication']].astype(str)

x_test_cat[['asa_status', 'baseline_cancer', 'baseline_charlson', 'baseline_cvd', 'baseline_dementia',
        'baseline_diabetes', 'baseline_digestive', 'baseline_osteoart', 'baseline_psych', 'baseline_pulmonary',
        'dow', 'gender', 'month', 'moonphase', 'race', 'complication']] = x_test_cat[['asa_status', 'baseline_cancer', 'baseline_charlson', 'baseline_cvd', 'baseline_dementia',
        'baseline_diabetes', 'baseline_digestive', 'baseline_osteoart', 'baseline_psych', 'baseline_pulmonary',
        'dow', 'gender', 'month', 'moonphase', 'race', 'complication']].astype(str)
In [36]:
x_train_cat.head()
Out[36]:
bmi Age asa_status baseline_cancer baseline_charlson baseline_cvd baseline_dementia baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary ahrq_ccs ccsComplicationRate ccsMort30Rate complication_rsi dow gender hour month moonphase mortality_rsi race complication
6976 28.98 90.0 1 0 0 1 0 0 0 1 0 0 0 0.081977 0.002959 -1.97 0 1 9.12 8 0 -2.25 1 0
14103 53.98 43.5 1 1 4 0 0 0 0 0 0 0 6 0.150706 0.001962 -0.20 1 0 18.43 9 1 0.72 0 1
11067 27.23 83.9 1 0 1 1 0 0 1 0 0 0 19 0.183370 0.007424 -0.26 3 1 7.78 9 2 -0.24 1 1
9541 43.75 55.7 1 0 0 1 0 0 0 1 0 0 0 0.081977 0.002959 -2.53 3 0 13.92 0 3 -3.07 1 0
11957 33.72 69.5 1 1 2 1 0 0 0 0 0 0 11 0.197305 0.002764 3.98 2 1 7.58 9 3 3.03 1 1
In [37]:
x_train_cat.nunique()
Out[37]:
bmi                    2686
Age                     644
asa_status                3
baseline_cancer           2
baseline_charlson        13
baseline_cvd              2
baseline_dementia         2
baseline_diabetes         2
baseline_digestive        2
baseline_osteoart         2
baseline_psych            2
baseline_pulmonary        2
ahrq_ccs                 22
ccsComplicationRate      22
ccsMort30Rate            20
complication_rsi        715
dow                       5
gender                    2
hour                    704
month                    12
moonphase                 4
mortality_rsi           589
race                      3
complication              2
dtype: int64
In [38]:
x_train.dtypes
Out[38]:
bmi                    float64
Age                    float64
asa_status               int64
baseline_cancer          int64
baseline_charlson        int64
baseline_cvd             int64
baseline_dementia        int64
baseline_diabetes        int64
baseline_digestive       int64
baseline_osteoart        int64
baseline_psych           int64
baseline_pulmonary       int64
ahrq_ccs                 int64
ccsComplicationRate    float64
ccsMort30Rate          float64
complication_rsi       float64
dow                      int64
gender                   int64
hour                   float64
month                    int64
moonphase                int64
mortality_rsi          float64
race                     int64
complication             int64
xgb_probs              float32
xgb_probs_auc          float32
xgb_probs_aucpr        float32
dtype: object
In [39]:
import numpy as np
# Create index for categorical variables

predictors = x_train_cat
categorical_var = np.where(predictors.dtypes != np.float)[0]
print('\nCategorical Variables indices : ',categorical_var)
Categorical Variables indices :  [ 2  3  4  5  6  7  8  9 10 11 12 16 17 19 20 22 23]
In [40]:
from catboost import CatBoostClassifier, Pool, cv
In [41]:
cat_boost_model = CatBoostClassifier(
    loss_function = 'Logloss',
    random_seed=42,
    iterations = 10,
    learning_rate = 0.03,
    early_stopping_rounds = 10,
    #l2_leaf_reg = ???
    depth = 3
    
)
In [42]:
cat_boost_model.fit(
    x_train_cat, y_train
    ,cat_features=categorical_var,
    eval_set=(x_val_cat, y_val)
    , plot = True
)
0:	learn: 0.6027667	test: 0.6027611	best: 0.6027611 (0)	total: 117ms	remaining: 1.06s
1:	learn: 0.5273552	test: 0.5278004	best: 0.5278004 (1)	total: 148ms	remaining: 591ms
2:	learn: 0.4501905	test: 0.4510574	best: 0.4510574 (2)	total: 177ms	remaining: 413ms
3:	learn: 0.3876575	test: 0.3886278	best: 0.3886278 (3)	total: 206ms	remaining: 310ms
4:	learn: 0.3370462	test: 0.3382135	best: 0.3382135 (4)	total: 241ms	remaining: 241ms
5:	learn: 0.2972373	test: 0.2987500	best: 0.2987500 (5)	total: 276ms	remaining: 184ms
6:	learn: 0.2634266	test: 0.2654371	best: 0.2654371 (6)	total: 310ms	remaining: 133ms
7:	learn: 0.2311741	test: 0.2334020	best: 0.2334020 (7)	total: 344ms	remaining: 86ms
8:	learn: 0.2051432	test: 0.2077160	best: 0.2077160 (8)	total: 375ms	remaining: 41.7ms
9:	learn: 0.1788389	test: 0.1816667	best: 0.1816667 (9)	total: 420ms	remaining: 0us

bestTest = 0.181666657
bestIteration = 9

Out[42]:
<catboost.core.CatBoostClassifier at 0x1a1b7d79e8>
In [43]:
catboost_probs_train = cat_boost_model.predict_proba(x_train_cat)
catboost_probs = cat_boost_model.predict_proba(x_test_cat)
In [44]:
catboost_probs_df_train = pd.DataFrame(catboost_probs_train)
catboost_probs_df_train = catboost_probs_df_train.add_prefix('cat')

catboost_probs_df = pd.DataFrame(catboost_probs)
catboost_probs_df = catboost_probs_df.add_prefix('cat')
fprc, tprc, thresholds = metrics.roc_curve(y_train, catboost_probs_df_train['cat1'])
metrics.auc(fprc, tprc)
Out[44]:
0.9292345044273065
In [45]:
fprc, tprc, thresholds = metrics.roc_curve(y_test, catboost_probs_df['cat1'])
metrics.auc(fprc, tprc)
Out[45]:
0.9418003040557108

CatBoost with Cross Entropy as objective example:

In [46]:
cat_boost_model_ce = CatBoostClassifier(
    loss_function = 'CrossEntropy',
    random_seed=42,
    iterations = 10,
    learning_rate = 0.03,
    early_stopping_rounds = 10,
    #l2_leaf_reg = ???
    depth = 3
    
)
In [47]:
cat_boost_model_ce.fit(
    x_train_cat, y_train
    ,cat_features=categorical_var,
    eval_set=(x_val_cat, y_val)
    , plot = True
)
0:	learn: 0.6027667	test: 0.6027611	best: 0.6027611 (0)	total: 34.7ms	remaining: 312ms
1:	learn: 0.5273552	test: 0.5278004	best: 0.5278004 (1)	total: 67.7ms	remaining: 271ms
2:	learn: 0.4501905	test: 0.4510574	best: 0.4510574 (2)	total: 101ms	remaining: 235ms
3:	learn: 0.3876575	test: 0.3886278	best: 0.3886278 (3)	total: 132ms	remaining: 199ms
4:	learn: 0.3370462	test: 0.3382135	best: 0.3382135 (4)	total: 177ms	remaining: 177ms
5:	learn: 0.2972373	test: 0.2987500	best: 0.2987500 (5)	total: 209ms	remaining: 140ms
6:	learn: 0.2634266	test: 0.2654371	best: 0.2654371 (6)	total: 246ms	remaining: 105ms
7:	learn: 0.2311741	test: 0.2334020	best: 0.2334020 (7)	total: 279ms	remaining: 69.7ms
8:	learn: 0.2051432	test: 0.2077160	best: 0.2077160 (8)	total: 317ms	remaining: 35.3ms
9:	learn: 0.1788389	test: 0.1816667	best: 0.1816667 (9)	total: 351ms	remaining: 0us

bestTest = 0.181666657
bestIteration = 9

Out[47]:
<catboost.core.CatBoostClassifier at 0x1a1b7d7a20>
In [48]:
catboost_probs_ce_train = cat_boost_model_ce.predict_proba(x_train_cat)
catboost_probs_ce = cat_boost_model_ce.predict_proba(x_test_cat)
In [49]:
catboost_probs_df_train_ce = pd.DataFrame(catboost_probs_ce_train)
catboost_probs_df_train_ce = catboost_probs_df_train_ce.add_prefix('cat')

catboost_probs_df_ce = pd.DataFrame(catboost_probs_ce)
catboost_probs_df_ce = catboost_probs_df_ce.add_prefix('cat')
fprc, tprc, thresholds = metrics.roc_curve(y_train, catboost_probs_df_train_ce['cat1'])
metrics.auc(fprc, tprc)
Out[49]:
0.9292345044273065
In [50]:
fprc, tprc, thresholds = metrics.roc_curve(y_test, catboost_probs_df_ce['cat1'])
metrics.auc(fprc, tprc)
Out[50]:
0.9418003040557108

Part II: Regression with Gradient Boosted Decision Trees

We will use the boston housing dataset to create gradient boosted decision trees for regression

In [51]:
from sklearn.datasets import load_boston
house_price = pd.read_csv('/Users/matthewberezo/Documents/kaggle_housing.csv')
In [52]:
house_price.shape
Out[52]:
(1460, 81)
In [53]:
house_price.dtypes
Out[53]:
Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
BsmtExposure      object
BsmtFinType1      object
BsmtFinSF1         int64
BsmtFinType2      object
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
Heating           object
HeatingQC         object
CentralAir        object
Electrical        object
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64
GarageFinish      object
GarageCars         int64
GarageArea         int64
GarageQual        object
GarageCond        object
PavedDrive        object
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
PoolQC            object
Fence             object
MiscFeature       object
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
dtype: object
In [54]:
target = house_price['SalePrice']
In [55]:
house_price.head()
Out[55]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
In [56]:
house_price['Alley'].unique()
Out[56]:
array([nan, 'Grvl', 'Pave'], dtype=object)
In [57]:
from sklearn import preprocessing

house_price['MSSubClass'] = house_price['MSSubClass'].astype(str)
le_mssubclass = preprocessing.LabelEncoder()
le_mssubclass.fit(house_price['MSSubClass'])
house_price['MSSUBCLASS_2'] = le_mssubclass.transform(house_price['MSSubClass'])

house_price['MSZoning'] = house_price['MSZoning'].astype(str)
le_MSZoning = preprocessing.LabelEncoder()
le_MSZoning.fit(house_price['MSZoning'])
house_price['MSZoning_2'] = le_MSZoning.transform(house_price['MSZoning'])

house_price['Street'] = house_price['Street'].astype(str)
le_Street = preprocessing.LabelEncoder()
le_Street.fit(house_price['Street'])
house_price['Street_2'] = le_Street.transform(house_price['Street'])

house_price['Alley'] = house_price['Alley'].astype(str)
le_Alley = preprocessing.LabelEncoder()
le_Alley.fit(house_price['Alley'])
house_price['Alley_2'] = le_Alley.transform(house_price['Alley'])

house_price['LotShape'] = house_price['LotShape'].astype(str)
le_LotShape = preprocessing.LabelEncoder()
le_LotShape.fit(house_price['LotShape'])
house_price['LotShape_2'] = le_LotShape.transform(house_price['LotShape'])

house_price['LandContour'] = house_price['LandContour'].astype(str)
le_LandContour = preprocessing.LabelEncoder()
le_LandContour.fit(house_price['LandContour'])
house_price['LandContour_2'] = le_LandContour.transform(house_price['LandContour'])

house_price['Utilities'] = house_price['Utilities'].astype(str)
le_Utilities = preprocessing.LabelEncoder()
le_Utilities.fit(house_price['Utilities'])
house_price['Utilities_2'] = le_Utilities.transform(house_price['Utilities'])

house_price['LotConfig'] = house_price['LotConfig'].astype(str)
le_LotConfig = preprocessing.LabelEncoder()
le_LotConfig.fit(house_price['LotConfig'])
house_price['LotConfig_2'] = le_LotConfig.transform(house_price['LotConfig'])

house_price['LandSlope'] = house_price['LandSlope'].astype(str)
le_LandSlope = preprocessing.LabelEncoder()
le_LandSlope.fit(house_price['LandSlope'])
house_price['LandSlope_2'] = le_LandSlope.transform(house_price['LandSlope'])

house_price['Neighborhood'] = house_price['Neighborhood'].astype(str)
le_Neighborhood = preprocessing.LabelEncoder()
le_Neighborhood.fit(house_price['Neighborhood'])
house_price['Neighborhood_2'] = le_Neighborhood.transform(house_price['Neighborhood'])

house_price['Condition1'] = house_price['Condition1'].astype(str)
le_Condition1 = preprocessing.LabelEncoder()
le_Condition1.fit(house_price['Condition1'])
house_price['Condition1_2'] = le_Condition1.transform(house_price['Condition1'])

house_price['Condition2'] = house_price['Condition2'].astype(str)
le_Condition2 = preprocessing.LabelEncoder()
le_Condition2.fit(house_price['Condition2'])
house_price['Condition2_2'] = le_Condition2.transform(house_price['Condition2'])

house_price['BldgType'] = house_price['BldgType'].astype(str)
le_BldgType = preprocessing.LabelEncoder()
le_BldgType.fit(house_price['BldgType'])
house_price['BldgType_2'] = le_BldgType.transform(house_price['BldgType'])

house_price['HouseStyle'] = house_price['HouseStyle'].astype(str)
le_HouseStyle = preprocessing.LabelEncoder()
le_HouseStyle.fit(house_price['HouseStyle'])
house_price['HouseStyle_2'] = le_HouseStyle.transform(house_price['HouseStyle'])

house_price['RoofStyle'] = house_price['RoofStyle'].astype(str)
le_RoofStyle = preprocessing.LabelEncoder()
le_RoofStyle.fit(house_price['RoofStyle'])
house_price['RoofStyle_2'] = le_RoofStyle.transform(house_price['RoofStyle'])

house_price['RoofMatl'] = house_price['RoofMatl'].astype(str)
le_RoofMatl = preprocessing.LabelEncoder()
le_RoofMatl.fit(house_price['RoofMatl'])
house_price['RoofMatl_2'] = le_RoofMatl.transform(house_price['RoofMatl'])

house_price['Exterior1st'] = house_price['Exterior1st'].astype(str)
le_Exterior1st = preprocessing.LabelEncoder()
le_Exterior1st.fit(house_price['Exterior1st'])
house_price['Exterior1st_2'] = le_Exterior1st.transform(house_price['Exterior1st'])

house_price['Exterior2nd'] = house_price['Exterior2nd'].astype(str)
le_Exterior2nd = preprocessing.LabelEncoder()
le_Exterior2nd.fit(house_price['Exterior2nd'])
house_price['Exterior2nd_2'] = le_Exterior2nd.transform(house_price['Exterior2nd'])

house_price['MasVnrType'] = house_price['MasVnrType'].astype(str)
le_MasVnrType = preprocessing.LabelEncoder()
le_MasVnrType.fit(house_price['MasVnrType'])
house_price['MasVnrType_2'] = le_MasVnrType.transform(house_price['MasVnrType'])

house_price['ExterQual'] = house_price['ExterQual'].astype(str)
le_ExterQual = preprocessing.LabelEncoder()
le_ExterQual.fit(house_price['ExterQual'])
house_price['ExterQual_2'] = le_ExterQual.transform(house_price['ExterQual'])

house_price['Foundation'] = house_price['Foundation'].astype(str)
le_Foundation = preprocessing.LabelEncoder()
le_Foundation.fit(house_price['Foundation'])
house_price['Foundation_2'] = le_Foundation.transform(house_price['Foundation'])

house_price['BsmtQual'] = house_price['BsmtQual'].astype(str)
le_BsmtQual = preprocessing.LabelEncoder()
le_BsmtQual.fit(house_price['BsmtQual'])
house_price['BsmtQual_2'] = le_BsmtQual.transform(house_price['BsmtQual'])

house_price['BsmtCond'] = house_price['BsmtCond'].astype(str)
le_BsmtCond = preprocessing.LabelEncoder()
le_BsmtCond.fit(house_price['BsmtCond'])
house_price['BsmtCond_2'] = le_BsmtCond.transform(house_price['BsmtCond'])

house_price['BsmtExposure'] = house_price['BsmtExposure'].astype(str)
le_BsmtExposure = preprocessing.LabelEncoder()
le_BsmtExposure.fit(house_price['BsmtExposure'])
house_price['BsmtExposure_2'] = le_BsmtExposure.transform(house_price['BsmtExposure'])

house_price['BsmtFinType1'] = house_price['BsmtFinType1'].astype(str)
le_BsmtFinType1 = preprocessing.LabelEncoder()
le_BsmtFinType1.fit(house_price['BsmtFinType1'])
house_price['BsmtFinType1_2'] = le_BsmtFinType1.transform(house_price['BsmtFinType1'])

house_price['BsmtFinType2'] = house_price['BsmtFinType2'].astype(str)
le_BsmtFinType2 = preprocessing.LabelEncoder()
le_BsmtFinType2.fit(house_price['BsmtFinType2'])
house_price['BsmtFinType2_2'] = le_BsmtFinType2.transform(house_price['BsmtFinType2'])

house_price['Heating'] = house_price['Heating'].astype(str)
le_Heating = preprocessing.LabelEncoder()
le_Heating.fit(house_price['Heating'])
house_price['Heating_2'] = le_Heating.transform(house_price['Heating'])

house_price['HeatingQC'] = house_price['HeatingQC'].astype(str)
le_HeatingQC = preprocessing.LabelEncoder()
le_HeatingQC.fit(house_price['HeatingQC'])
house_price['HeatingQC_2'] = le_HeatingQC.transform(house_price['HeatingQC'])

house_price['CentralAir'] = house_price['CentralAir'].astype(str)
le_CentralAir = preprocessing.LabelEncoder()
le_CentralAir.fit(house_price['CentralAir'])
house_price['CentralAir_2'] = le_CentralAir.transform(house_price['CentralAir'])

house_price['Electrical'] = house_price['Electrical'].astype(str)
le_Electrical = preprocessing.LabelEncoder()
le_Electrical.fit(house_price['Electrical'])
house_price['Electrical_2'] = le_Electrical.transform(house_price['Electrical'])

house_price['KitchenQual'] = house_price['KitchenQual'].astype(str)
le_KitchenQual = preprocessing.LabelEncoder()
le_KitchenQual.fit(house_price['KitchenQual'])
house_price['KitchenQual_2'] = le_KitchenQual.transform(house_price['KitchenQual'])

house_price['FireplaceQu'] = house_price['FireplaceQu'].astype(str)
le_FireplaceQu = preprocessing.LabelEncoder()
le_FireplaceQu.fit(house_price['FireplaceQu'])
house_price['FireplaceQu_2'] = le_FireplaceQu.transform(house_price['FireplaceQu'])

house_price['GarageType'] = house_price['GarageType'].astype(str)
le_GarageType = preprocessing.LabelEncoder()
le_GarageType.fit(house_price['GarageType'])
house_price['GarageType_2'] = le_GarageType.transform(house_price['GarageType'])

house_price['GarageFinish'] = house_price['GarageFinish'].astype(str)
le_GarageFinish = preprocessing.LabelEncoder()
le_GarageFinish.fit(house_price['GarageFinish'])
house_price['GarageFinish_2'] = le_GarageFinish.transform(house_price['GarageFinish'])

house_price['GarageQual'] = house_price['GarageQual'].astype(str)
le_GarageQual = preprocessing.LabelEncoder()
le_GarageQual.fit(house_price['GarageQual'])
house_price['GarageQual_2'] = le_GarageQual.transform(house_price['GarageQual'])

house_price['GarageCond'] = house_price['GarageCond'].astype(str)
le_GarageCond = preprocessing.LabelEncoder()
le_GarageCond.fit(house_price['GarageCond'])
house_price['GarageCond_2'] = le_GarageCond.transform(house_price['GarageCond'])

house_price['PavedDrive'] = house_price['PavedDrive'].astype(str)
le_PavedDrive = preprocessing.LabelEncoder()
le_PavedDrive.fit(house_price['PavedDrive'])
house_price['PavedDrive_2'] = le_PavedDrive.transform(house_price['PavedDrive'])
In [58]:
house_price.head()
Out[58]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice MSSUBCLASS_2 MSZoning_2 Street_2 Alley_2 LotShape_2 LandContour_2 Utilities_2 LotConfig_2 LandSlope_2 Neighborhood_2 Condition1_2 Condition2_2 BldgType_2 HouseStyle_2 RoofStyle_2 RoofMatl_2 Exterior1st_2 Exterior2nd_2 MasVnrType_2 ExterQual_2 Foundation_2 BsmtQual_2 BsmtCond_2 BsmtExposure_2 BsmtFinType1_2 BsmtFinType2_2 Heating_2 HeatingQC_2 CentralAir_2 Electrical_2 KitchenQual_2 FireplaceQu_2 GarageType_2 GarageFinish_2 GarageQual_2 GarageCond_2 PavedDrive_2
0 1 60 RL 65.0 8450 Pave nan Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 nan Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500 9 3 1 2 3 3 0 4 0 5 2 2 0 5 1 1 12 13 1 2 2 2 3 3 2 5 1 0 1 4 2 5 1 1 4 4 2
1 2 20 RL 80.0 9600 Pave nan Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500 4 3 1 2 3 3 0 2 0 24 1 2 0 2 1 1 8 8 2 3 1 2 3 1 0 5 1 0 1 4 3 4 1 1 4 4 2
2 3 60 RL 68.0 11250 Pave nan IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500 9 3 1 2 0 3 0 4 0 5 2 2 0 5 1 1 12 13 1 2 2 2 3 2 2 5 1 0 1 4 2 4 1 1 4 4 2
3 4 70 RL 60.0 9550 Pave nan IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000 10 3 1 2 0 3 0 0 0 6 2 2 0 5 1 1 13 15 2 3 0 3 1 3 0 5 1 2 1 4 2 2 5 2 4 4 2
4 5 60 RL 84.0 14260 Pave nan IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000 9 3 1 2 0 3 0 2 0 15 2 2 0 5 1 1 12 13 1 2 2 2 3 0 2 5 1 0 1 4 2 4 1 1 4 4 2
In [59]:
i_vars = house_price[['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
                            'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
                            'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath'
                            ,'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr'
                            ,'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
                            'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch'
                            ,'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'
                            ,'MSSUBCLASS_2', 'MSZoning_2', 'Street_2', 'Alley_2', 'LotShape_2'
                            ,'LandContour_2', 'Utilities_2', 'LotConfig_2', 'LandSlope_2'
                            ,'Neighborhood_2',
                            'Condition1_2',	'Condition2_2',	'BldgType_2',	'HouseStyle_2'
                            ,'RoofStyle_2',	'RoofMatl_2',	'Exterior1st_2',	'Exterior2nd_2',
                            'MasVnrType_2',	'ExterQual_2',	'Foundation_2',	'BsmtQual_2',	'BsmtCond_2',
                            'BsmtExposure_2',	'BsmtFinType1_2',	'BsmtFinType2_2',	'Heating_2',
                            'HeatingQC_2',	'CentralAir_2',	'Electrical_2',	'KitchenQual_2',
                            'FireplaceQu_2',	'GarageType_2',	'GarageFinish_2',	'GarageQual_2',
                            'GarageCond_2',	'PavedDrive_2'
                                     ]]

target = house_price['SalePrice']
In [60]:
# Split new dataset into training and test algorithms

x_trainr, x_testr, y_trainr, y_testr = train_test_split(i_vars, 
                                                    target, 
                                                    test_size=0.2, 
                                                    random_state=1)

x_trainr, x_valr, y_trainr, y_valr = train_test_split(x_trainr, y_trainr, test_size = 0.2, random_state = 1)
In [61]:
x_trainr.shape
Out[61]:
(934, 69)
In [62]:
dtrainr = xgb.DMatrix(data = x_trainr, label = y_trainr)
dvalr = xgb.DMatrix(data = x_valr, label = y_valr)
dtestr = xgb.DMatrix(data = x_testr, label = y_testr)
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
In [63]:
param_r = {'booster' : 'gblinear'
           #,'lambda' = ???
           #,'alpha' = ???
           ,'feature_selector' : 'cyclic' #also have 'shuffle', 'random', 'greedy', 'thrifty'
           #, 'top_k' : ??? # only available for greedy and thrifty selector
           , 'objective' : 'reg:squarederror' #also have 'squaredlogerror'
           , 'eval_metric' : 'rmse' # also have 'rmsle',
           , 'maximize' : 'FALSE'
        }
In [64]:
watchlist = [(dtrainr, 'train'), (dvalr, 'eval')]
num_round = 25 #This is another hyperparameter of sorts
xgb_r = xgb.train(param_r, dtrainr, num_round, watchlist, early_stopping_rounds = 10)
[0]	train-rmse:52410.2	eval-rmse:44897.1
Multiple eval metrics have been passed: 'eval-rmse' will be used for early stopping.

Will train until eval-rmse hasn't improved in 10 rounds.
[1]	train-rmse:47075.1	eval-rmse:38914.6
[2]	train-rmse:44461.4	eval-rmse:36179.1
[3]	train-rmse:42617.7	eval-rmse:34367.5
[4]	train-rmse:41137.4	eval-rmse:33023.8
[5]	train-rmse:39946.3	eval-rmse:31942
[6]	train-rmse:38977.2	eval-rmse:31110.5
[7]	train-rmse:38109	eval-rmse:30345.6
[8]	train-rmse:37438.6	eval-rmse:29761
[9]	train-rmse:36861.3	eval-rmse:29306
[10]	train-rmse:36393.1	eval-rmse:28859.8
[11]	train-rmse:35999.2	eval-rmse:28524.8
[12]	train-rmse:35639.2	eval-rmse:28172.2
[13]	train-rmse:35344.6	eval-rmse:27921.6
[14]	train-rmse:35074.2	eval-rmse:27693.7
[15]	train-rmse:34854.9	eval-rmse:27506.9
[16]	train-rmse:34666.1	eval-rmse:27324.9
[17]	train-rmse:34488.2	eval-rmse:27164.6
[18]	train-rmse:34340.1	eval-rmse:27006.4
[19]	train-rmse:34180.2	eval-rmse:26872
[20]	train-rmse:34041.7	eval-rmse:26764.5
[21]	train-rmse:33918.8	eval-rmse:26638.6
[22]	train-rmse:33826.3	eval-rmse:26596.8
[23]	train-rmse:33726.9	eval-rmse:26508.3
[24]	train-rmse:33641.1	eval-rmse:26466.7
In [65]:
house_price_train_preds = x_trainr
house_price_train_preds['price_pred'] = xgb_r.predict(dtrainr)

house_price_test_preds = x_testr
house_price_test_preds['price_pred'] = xgb_r.predict(dtestr)
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
In [66]:
house_price_train_preds['PRICE'] = y_trainr
house_price_test_preds['PRICE'] = y_testr
house_price_test_preds['ERROR'] = house_price_test_preds['PRICE'] - house_price_test_preds['price_pred']

from sklearn.metrics import r2_score

print("Train R2 =", r2_score(house_price_train_preds['PRICE'], house_price_train_preds['price_pred'])
    ,"Test R2 =", r2_score(house_price_test_preds['PRICE'], house_price_test_preds['price_pred']))
Train R2 = 0.8184215794887328 Test R2 = 0.8275516513661225
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [67]:
import plotly.express as px
fig = px.scatter(house_price_test_preds, x="price_pred", y="PRICE" #color = #,
                 ,hover_data=['ERROR', 'PRICE']
                )
fig.show()

LightGBM example for regression:

In [68]:
lgb_params_r = {
    'boosting_type': 'gbdt', # also have goss, and dart
    'objective': 'regression',
    'metric': 'l1', # also 'mean_absolute_error', 'mae', 'root_mean_squared_error'
    'max_depth' : 3,
    #'num_leaves' : ???
    'learning_rate': 0.1,
    #'num_threads' : -1,
    #'scale_pos_weight' : ???
    'early_stopping_round' : 10,
    # min_data_in_leaf = ???,
    # pos_bagging_fraction = ???,
    # neg_bagging_fraction = ???,
    # bagging_freq = ???,
    # max_delta_step = ???,
    #'top_rate' : ???
    #'other_rate' : ???
    #'lambda_l1' : ???
    #'lambda_l2' : ???
}
In [69]:
lgb_house_train = lgb.Dataset(x_trainr.drop(columns = ['price_pred', 'PRICE']), y_trainr)
lgb_house_val = lgb.Dataset(x_valr, y_valr)
In [70]:
lgb_gbm_reg = lgb.train(params = lgb_params_r, train_set = lgb_house_train,
                num_boost_round = 100, valid_sets = [lgb_house_val, lgb_house_train],
               valid_names = ['Evaluation', 'Train'])
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:123: UserWarning:

Found `early_stopping_round` in params. Will use it instead of argument

[1]	Train's l1: 53324.3	Evaluation's l1: 50792.3
Training until validation scores don't improve for 10 rounds.
[2]	Train's l1: 49562.5	Evaluation's l1: 46989.2
[3]	Train's l1: 46142.9	Evaluation's l1: 43725
[4]	Train's l1: 43118.5	Evaluation's l1: 40989
[5]	Train's l1: 40285.7	Evaluation's l1: 37853.4
[6]	Train's l1: 37710.1	Evaluation's l1: 35472.9
[7]	Train's l1: 35541.6	Evaluation's l1: 33585.8
[8]	Train's l1: 33276.3	Evaluation's l1: 31734.4
[9]	Train's l1: 31096.8	Evaluation's l1: 29850.6
[10]	Train's l1: 29524.9	Evaluation's l1: 28388.3
[11]	Train's l1: 27965.5	Evaluation's l1: 27090.7
[12]	Train's l1: 26625.5	Evaluation's l1: 26045.1
[13]	Train's l1: 25512.2	Evaluation's l1: 25147.3
[14]	Train's l1: 24355.2	Evaluation's l1: 24233.1
[15]	Train's l1: 23483.6	Evaluation's l1: 23486.1
[16]	Train's l1: 22604.7	Evaluation's l1: 22786.7
[17]	Train's l1: 21815.7	Evaluation's l1: 22141.2
[18]	Train's l1: 21083.2	Evaluation's l1: 21565.9
[19]	Train's l1: 20464.1	Evaluation's l1: 21020.7
[20]	Train's l1: 19857.3	Evaluation's l1: 20567.1
[21]	Train's l1: 19294.6	Evaluation's l1: 20252.1
[22]	Train's l1: 18813.1	Evaluation's l1: 19926.1
[23]	Train's l1: 18414.8	Evaluation's l1: 19656.4
[24]	Train's l1: 17995.2	Evaluation's l1: 19327.6
[25]	Train's l1: 17610.8	Evaluation's l1: 18942.7
[26]	Train's l1: 17242.7	Evaluation's l1: 18665.8
[27]	Train's l1: 17042.7	Evaluation's l1: 18520.5
[28]	Train's l1: 16804.8	Evaluation's l1: 18187.2
[29]	Train's l1: 16581.3	Evaluation's l1: 17989
[30]	Train's l1: 16381.4	Evaluation's l1: 17884.7
[31]	Train's l1: 16193.5	Evaluation's l1: 17782.2
[32]	Train's l1: 16018.2	Evaluation's l1: 17604.4
[33]	Train's l1: 15882.4	Evaluation's l1: 17504
[34]	Train's l1: 15716.3	Evaluation's l1: 17406.5
[35]	Train's l1: 15632.5	Evaluation's l1: 17358.3
[36]	Train's l1: 15496.1	Evaluation's l1: 17352.8
[37]	Train's l1: 15327.2	Evaluation's l1: 17231.8
[38]	Train's l1: 15249.9	Evaluation's l1: 17197.4
[39]	Train's l1: 15169.7	Evaluation's l1: 17159.7
[40]	Train's l1: 15107.8	Evaluation's l1: 17095.3
[41]	Train's l1: 14973.1	Evaluation's l1: 17101.9
[42]	Train's l1: 14941.7	Evaluation's l1: 17131.2
[43]	Train's l1: 14857	Evaluation's l1: 17040.8
[44]	Train's l1: 14742	Evaluation's l1: 17036.3
[45]	Train's l1: 14626.1	Evaluation's l1: 17006.3
[46]	Train's l1: 14537.5	Evaluation's l1: 16941.7
[47]	Train's l1: 14506	Evaluation's l1: 16939.6
[48]	Train's l1: 14412.4	Evaluation's l1: 16954.5
[49]	Train's l1: 14346.4	Evaluation's l1: 16906.3
[50]	Train's l1: 14252.9	Evaluation's l1: 16840
[51]	Train's l1: 14232.5	Evaluation's l1: 16800.7
[52]	Train's l1: 14149	Evaluation's l1: 16722.8
[53]	Train's l1: 14080.1	Evaluation's l1: 16734.2
[54]	Train's l1: 14069.3	Evaluation's l1: 16662.9
[55]	Train's l1: 14045.9	Evaluation's l1: 16614.1
[56]	Train's l1: 13950	Evaluation's l1: 16593.4
[57]	Train's l1: 13887.6	Evaluation's l1: 16506.2
[58]	Train's l1: 13846	Evaluation's l1: 16529.3
[59]	Train's l1: 13794.9	Evaluation's l1: 16526.3
[60]	Train's l1: 13777.2	Evaluation's l1: 16472
[61]	Train's l1: 13715	Evaluation's l1: 16512.3
[62]	Train's l1: 13702.7	Evaluation's l1: 16489.1
[63]	Train's l1: 13647.2	Evaluation's l1: 16407.8
[64]	Train's l1: 13610.4	Evaluation's l1: 16407.5
[65]	Train's l1: 13568.9	Evaluation's l1: 16354.5
[66]	Train's l1: 13528.6	Evaluation's l1: 16280.2
[67]	Train's l1: 13478.2	Evaluation's l1: 16278.1
[68]	Train's l1: 13428.3	Evaluation's l1: 16244.8
[69]	Train's l1: 13351.9	Evaluation's l1: 16228.7
[70]	Train's l1: 13337.3	Evaluation's l1: 16195.7
[71]	Train's l1: 13325.3	Evaluation's l1: 16192.1
[72]	Train's l1: 13274.4	Evaluation's l1: 16213
[73]	Train's l1: 13232.8	Evaluation's l1: 16207.7
[74]	Train's l1: 13225.7	Evaluation's l1: 16166.7
[75]	Train's l1: 13214.8	Evaluation's l1: 16169.3
[76]	Train's l1: 13191.8	Evaluation's l1: 16204.1
[77]	Train's l1: 13174.5	Evaluation's l1: 16197.7
[78]	Train's l1: 13163.1	Evaluation's l1: 16195.1
[79]	Train's l1: 13098.4	Evaluation's l1: 16168.1
[80]	Train's l1: 13088.4	Evaluation's l1: 16138.5
[81]	Train's l1: 13050.2	Evaluation's l1: 16133.1
[82]	Train's l1: 13017.1	Evaluation's l1: 16114.9
[83]	Train's l1: 12989.1	Evaluation's l1: 16134.7
[84]	Train's l1: 12968.7	Evaluation's l1: 16144.6
[85]	Train's l1: 12956.5	Evaluation's l1: 16132.2
[86]	Train's l1: 12922.7	Evaluation's l1: 16087.4
[87]	Train's l1: 12851	Evaluation's l1: 16080.4
[88]	Train's l1: 12826.7	Evaluation's l1: 16058.4
[89]	Train's l1: 12813	Evaluation's l1: 16057.8
[90]	Train's l1: 12739.6	Evaluation's l1: 16070.2
[91]	Train's l1: 12719.1	Evaluation's l1: 16079.3
[92]	Train's l1: 12679.5	Evaluation's l1: 16059.3
[93]	Train's l1: 12647.4	Evaluation's l1: 16072.8
[94]	Train's l1: 12627.7	Evaluation's l1: 16077.5
[95]	Train's l1: 12556.1	Evaluation's l1: 16032.6
[96]	Train's l1: 12475.6	Evaluation's l1: 16006.6
[97]	Train's l1: 12465.8	Evaluation's l1: 15996
[98]	Train's l1: 12449.6	Evaluation's l1: 16002.4
[99]	Train's l1: 12400.7	Evaluation's l1: 15964.6
[100]	Train's l1: 12346.4	Evaluation's l1: 15952
Did not meet early stopping. Best iteration is:
[100]	Train's l1: 12346.4	Evaluation's l1: 15952
In [71]:
house_price_train_preds['price_pred_lgb'] = lgb_gbm_reg.predict(x_trainr)

house_price_test_preds['price_pred_lgb'] = lgb_gbm_reg.predict(x_testr)

house_price_test_preds['ERROR_lgb'] = house_price_test_preds['PRICE'] - house_price_test_preds['price_pred_lgb']

print("LGB Train R2 =", r2_score(house_price_train_preds['PRICE'], house_price_train_preds['price_pred_lgb'])
    ,"LGB Test R2 =", r2_score(house_price_test_preds['PRICE'], house_price_test_preds['price_pred_lgb']))
LGB Train R2 = 0.9406910905140995 LGB Test R2 = 0.8786566847373365
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [72]:
import plotly.express as px
fig = px.scatter(house_price_test_preds, x="price_pred_lgb", y="PRICE" #color = #,
                 ,hover_data=['ERROR_lgb', 'PRICE']
                )
fig.show()

Part III: Multinomial Classification

XGBoost for multinomial classification

In [160]:
# We will reuse our wine dataset for multinomial classification:
wine_df = pd.read_csv('/Users/matthewberezo/Documents/wineQualityReds.csv')
In [161]:
wine_df = wine_df.drop(['Unnamed: 0'], axis=1)
wine_df.head()
Out[161]:
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [162]:
wine_df.shape
Out[162]:
(1599, 12)
In [163]:
wine_df['quality'].unique()
Out[163]:
array([5, 6, 7, 4, 8, 3])
In [164]:
wine_df['quality'] = wine_df['quality'] - 3
wine_df['quality'].unique()
Out[164]:
array([2, 3, 4, 1, 5, 0])
In [165]:
x_train, x_test, y_train, y_test = train_test_split(wine_df, 
                                                    wine_df['quality'], 
                                                    test_size=0.2, 
                                                    random_state=1)

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size = 0.2, random_state = 1)
In [166]:
xgb_param_mn = {'max_depth':3,
         'eta': 0.35,
         'silent':1,
         'objective':'multi:softprob', # also have multi:softmax --> need to set "num_class" if this is used
         'eval_metric': 'mlogloss',
                'num_class' : 6
         #,'gamma': ???,
         #,'lambda': ???,
         #,'alpha': ???,
         #,'min_child_weight': ???,
         #,'colsample_bytree' :??? 
         #,colsample_bynode' : ???
         ,'maximize' : 'FALSE'
         ,'n_jobs' : -1
         #,'base_score' : ???
         #,'max_delta_step' : ???
        }
In [167]:
dtrain_mn = xgb.DMatrix(data = x_train.drop(columns = 'quality'), label = y_train)
dval_mn = xgb.DMatrix(data = x_val.drop(columns = 'quality'), label = y_val)
dtest_mn = xgb.DMatrix(data = x_test.drop(columns = 'quality'), label = y_test)
In [168]:
# specify validations set to watch performance
watchlist = [(dtrain_mn, 'train'), (dval_mn, 'eval')]
num_round = 25 #This is another hyperparameter of sorts
bst = xgb.train(xgb_param_mn, dtrain_mn, num_round, watchlist, early_stopping_rounds = 10)
[0]	train-mlogloss:1.44009	eval-mlogloss:1.4771
Multiple eval metrics have been passed: 'eval-mlogloss' will be used for early stopping.

Will train until eval-mlogloss hasn't improved in 10 rounds.
[1]	train-mlogloss:1.24602	eval-mlogloss:1.30424
[2]	train-mlogloss:1.12197	eval-mlogloss:1.19811
[3]	train-mlogloss:1.03202	eval-mlogloss:1.12429
[4]	train-mlogloss:0.963136	eval-mlogloss:1.07031
[5]	train-mlogloss:0.911838	eval-mlogloss:1.03106
[6]	train-mlogloss:0.868054	eval-mlogloss:1.0012
[7]	train-mlogloss:0.831001	eval-mlogloss:0.976162
[8]	train-mlogloss:0.800427	eval-mlogloss:0.958199
[9]	train-mlogloss:0.773763	eval-mlogloss:0.948184
[10]	train-mlogloss:0.750337	eval-mlogloss:0.930876
[11]	train-mlogloss:0.72819	eval-mlogloss:0.921926
[12]	train-mlogloss:0.71177	eval-mlogloss:0.916365
[13]	train-mlogloss:0.692006	eval-mlogloss:0.910865
[14]	train-mlogloss:0.677181	eval-mlogloss:0.907199
[15]	train-mlogloss:0.660626	eval-mlogloss:0.90019
[16]	train-mlogloss:0.648831	eval-mlogloss:0.897819
[17]	train-mlogloss:0.633846	eval-mlogloss:0.896952
[18]	train-mlogloss:0.621332	eval-mlogloss:0.891391
[19]	train-mlogloss:0.610788	eval-mlogloss:0.890089
[20]	train-mlogloss:0.597789	eval-mlogloss:0.884995
[21]	train-mlogloss:0.585742	eval-mlogloss:0.882559
[22]	train-mlogloss:0.575343	eval-mlogloss:0.878576
[23]	train-mlogloss:0.563857	eval-mlogloss:0.872269
[24]	train-mlogloss:0.554839	eval-mlogloss:0.872766
In [169]:
preds = bst.predict(dtrain_mn)
preds_test = bst.predict(dtest_mn)
In [170]:
best_preds = np.asarray([np.argmax(line) for line in preds])
best_preds_test = np.asarray([np.argmax(line) for line in preds_test])
In [171]:
best_preds_df_train = pd.DataFrame(best_preds).add_prefix('PRED_QUAL')

best_preds_df_test = pd.DataFrame(best_preds_test).add_prefix('PRED_QUAL')
best_preds_df_test.head()
Out[171]:
PRED_QUAL0
0 3
1 3
2 3
3 3
4 3
In [172]:
y_train_df = pd.DataFrame(y_train).add_prefix('QUALITY')
y_train_df = y_train_df.reset_index()
y_train_df['PRED_QUALITY'] = best_preds_df_train['PRED_QUAL0']
y_train_df['CORRECT_PREDS'] = np.where(y_train_df['PRED_QUALITY'] == y_train_df['QUALITYquality'], 1, 0)

y_test_df = pd.DataFrame(y_test).add_prefix('QUALITY')
y_test_df = y_test_df.reset_index()
y_test_df['PRED_QUALITY'] = best_preds_df_test['PRED_QUAL0']
y_test_df['CORRECT_PREDS'] = np.where(y_test_df['PRED_QUALITY'] == y_test_df['QUALITYquality'], 1, 0)
y_test_df.head()
Out[172]:
index QUALITYquality PRED_QUALITY CORRECT_PREDS
0 75 2 3 0
1 1283 3 3 1
2 408 3 3 1
3 1281 3 3 1
4 1118 3 3 1
In [173]:
sum(y_train_df['CORRECT_PREDS'])/len(y_train_df)
Out[173]:
0.8172043010752689
In [174]:
sum(y_test_df['CORRECT_PREDS'])/len(y_test_df)
Out[174]:
0.640625

Multinomial Classification with LightGBM

In [175]:
lgb_params_mn = {
    'boosting_type': 'gbdt',
    'objective': 'multiclass', # also have multiclassova
    'metric': 'multi_logloss',
    'num_class' : 6,
    'max_depth' : 3,
    #'num_leaves' : ???
    'learning_rate': 0.1,
    #'num_threads' : -1,
    #'scale_pos_weight' : ???
    'early_stopping_round' : 10,
    # min_data_in_leaf = ???,
    # pos_bagging_fraction = ???,
    # neg_bagging_fraction = ???,
    # bagging_freq = ???,
    # max_delta_step = ???,
    #'top_rate' : ???
    #'other_rate' : ???
    #'lambda_l1' : ???
    #'lambda_l2' : ???
}
In [176]:
lgb_wine_train = lgb.Dataset(x_train.drop(columns = 'quality'), y_train)
lgb_wine_val = lgb.Dataset(x_val.drop(columns = 'quality'), y_val)
In [177]:
lgb_gbm_nm = lgb.train(params = lgb_params_mn, train_set = lgb_wine_train,
                num_boost_round = 100, valid_sets = [lgb_wine_val, lgb_wine_train],
               valid_names = ['Evaluation', 'Train'])
[1]	Train's multi_logloss: 1.16149	Evaluation's multi_logloss: 1.12062
Training until validation scores don't improve for 10 rounds.
[2]	Train's multi_logloss: 1.12305	Evaluation's multi_logloss: 1.0959
[3]	Train's multi_logloss: 1.08929	Evaluation's multi_logloss: 1.08047
[4]	Train's multi_logloss: 1.0589	Evaluation's multi_logloss: 1.06266
[5]	Train's multi_logloss: 1.03425	Evaluation's multi_logloss: 1.04581
[6]	Train's multi_logloss: 1.0121	Evaluation's multi_logloss: 1.03106
[7]	Train's multi_logloss: 0.991015	Evaluation's multi_logloss: 1.01942
[8]	Train's multi_logloss: 0.972148	Evaluation's multi_logloss: 1.00779
[9]	Train's multi_logloss: 0.955932	Evaluation's multi_logloss: 0.998886
[10]	Train's multi_logloss: 0.940209	Evaluation's multi_logloss: 0.989636
[11]	Train's multi_logloss: 0.925009	Evaluation's multi_logloss: 0.980539
[12]	Train's multi_logloss: 0.911391	Evaluation's multi_logloss: 0.974622
[13]	Train's multi_logloss: 0.89886	Evaluation's multi_logloss: 0.967791
[14]	Train's multi_logloss: 0.887049	Evaluation's multi_logloss: 0.96315
[15]	Train's multi_logloss: 0.876694	Evaluation's multi_logloss: 0.957396
[16]	Train's multi_logloss: 0.866565	Evaluation's multi_logloss: 0.953105
[17]	Train's multi_logloss: 0.856415	Evaluation's multi_logloss: 0.94751
[18]	Train's multi_logloss: 0.846986	Evaluation's multi_logloss: 0.942516
[19]	Train's multi_logloss: 0.838905	Evaluation's multi_logloss: 0.937895
[20]	Train's multi_logloss: 0.829045	Evaluation's multi_logloss: 0.933193
[21]	Train's multi_logloss: 0.821095	Evaluation's multi_logloss: 0.931277
[22]	Train's multi_logloss: 0.813772	Evaluation's multi_logloss: 0.927061
[23]	Train's multi_logloss: 0.805308	Evaluation's multi_logloss: 0.923197
[24]	Train's multi_logloss: 0.799234	Evaluation's multi_logloss: 0.920035
[25]	Train's multi_logloss: 0.79242	Evaluation's multi_logloss: 0.917719
[26]	Train's multi_logloss: 0.784722	Evaluation's multi_logloss: 0.914152
[27]	Train's multi_logloss: 0.779113	Evaluation's multi_logloss: 0.910955
[28]	Train's multi_logloss: 0.773132	Evaluation's multi_logloss: 0.908096
[29]	Train's multi_logloss: 0.767466	Evaluation's multi_logloss: 0.90709
[30]	Train's multi_logloss: 0.761835	Evaluation's multi_logloss: 0.905268
[31]	Train's multi_logloss: 0.755885	Evaluation's multi_logloss: 0.902036
[32]	Train's multi_logloss: 0.750416	Evaluation's multi_logloss: 0.900541
[33]	Train's multi_logloss: 0.745015	Evaluation's multi_logloss: 0.898562
[34]	Train's multi_logloss: 0.740407	Evaluation's multi_logloss: 0.897805
[35]	Train's multi_logloss: 0.734986	Evaluation's multi_logloss: 0.894975
[36]	Train's multi_logloss: 0.730589	Evaluation's multi_logloss: 0.89392
[37]	Train's multi_logloss: 0.726285	Evaluation's multi_logloss: 0.891377
[38]	Train's multi_logloss: 0.721481	Evaluation's multi_logloss: 0.889431
[39]	Train's multi_logloss: 0.717953	Evaluation's multi_logloss: 0.888948
[40]	Train's multi_logloss: 0.714254	Evaluation's multi_logloss: 0.886986
[41]	Train's multi_logloss: 0.710213	Evaluation's multi_logloss: 0.886599
[42]	Train's multi_logloss: 0.706002	Evaluation's multi_logloss: 0.885943
[43]	Train's multi_logloss: 0.702589	Evaluation's multi_logloss: 0.886441
[44]	Train's multi_logloss: 0.698469	Evaluation's multi_logloss: 0.886278
[45]	Train's multi_logloss: 0.694338	Evaluation's multi_logloss: 0.885399
[46]	Train's multi_logloss: 0.691387	Evaluation's multi_logloss: 0.884898
[47]	Train's multi_logloss: 0.688239	Evaluation's multi_logloss: 0.884298
[48]	Train's multi_logloss: 0.684297	Evaluation's multi_logloss: 0.883591
[49]	Train's multi_logloss: 0.681838	Evaluation's multi_logloss: 0.883739
[50]	Train's multi_logloss: 0.67842	Evaluation's multi_logloss: 0.88199
[51]	Train's multi_logloss: 0.675594	Evaluation's multi_logloss: 0.88052
[52]	Train's multi_logloss: 0.672433	Evaluation's multi_logloss: 0.88035
[53]	Train's multi_logloss: 0.66866	Evaluation's multi_logloss: 0.879536
[54]	Train's multi_logloss: 0.665775	Evaluation's multi_logloss: 0.878679
[55]	Train's multi_logloss: 0.662837	Evaluation's multi_logloss: 0.877464
[56]	Train's multi_logloss: 0.66037	Evaluation's multi_logloss: 0.876783
[57]	Train's multi_logloss: 0.657624	Evaluation's multi_logloss: 0.875825
[58]	Train's multi_logloss: 0.654868	Evaluation's multi_logloss: 0.87612
[59]	Train's multi_logloss: 0.651978	Evaluation's multi_logloss: 0.876119
[60]	Train's multi_logloss: 0.647898	Evaluation's multi_logloss: 0.874939
[61]	Train's multi_logloss: 0.645247	Evaluation's multi_logloss: 0.874794
[62]	Train's multi_logloss: 0.642509	Evaluation's multi_logloss: 0.874012
[63]	Train's multi_logloss: 0.639476	Evaluation's multi_logloss: 0.872552
[64]	Train's multi_logloss: 0.63679	Evaluation's multi_logloss: 0.872454
[65]	Train's multi_logloss: 0.63441	Evaluation's multi_logloss: 0.872135
[66]	Train's multi_logloss: 0.631869	Evaluation's multi_logloss: 0.871605
[67]	Train's multi_logloss: 0.628169	Evaluation's multi_logloss: 0.871238
[68]	Train's multi_logloss: 0.625595	Evaluation's multi_logloss: 0.871239
[69]	Train's multi_logloss: 0.62335	Evaluation's multi_logloss: 0.870751
[70]	Train's multi_logloss: 0.620533	Evaluation's multi_logloss: 0.869603
[71]	Train's multi_logloss: 0.617412	Evaluation's multi_logloss: 0.868818
[72]	Train's multi_logloss: 0.614697	Evaluation's multi_logloss: 0.868147
[73]	Train's multi_logloss: 0.612396	Evaluation's multi_logloss: 0.867526
[74]	Train's multi_logloss: 0.609294	Evaluation's multi_logloss: 0.86772
[75]	Train's multi_logloss: 0.606728	Evaluation's multi_logloss: 0.867203
[76]	Train's multi_logloss: 0.604354	Evaluation's multi_logloss: 0.867475
[77]	Train's multi_logloss: 0.600776	Evaluation's multi_logloss: 0.867763
[78]	Train's multi_logloss: 0.598852	Evaluation's multi_logloss: 0.868368
[79]	Train's multi_logloss: 0.596732	Evaluation's multi_logloss: 0.867838
[80]	Train's multi_logloss: 0.594745	Evaluation's multi_logloss: 0.868011
[81]	Train's multi_logloss: 0.591846	Evaluation's multi_logloss: 0.867508
[82]	Train's multi_logloss: 0.589231	Evaluation's multi_logloss: 0.866276
[83]	Train's multi_logloss: 0.587547	Evaluation's multi_logloss: 0.866869
[84]	Train's multi_logloss: 0.585358	Evaluation's multi_logloss: 0.868107
[85]	Train's multi_logloss: 0.582204	Evaluation's multi_logloss: 0.866635
[86]	Train's multi_logloss: 0.580582	Evaluation's multi_logloss: 0.866382
[87]	Train's multi_logloss: 0.578067	Evaluation's multi_logloss: 0.865642
[88]	Train's multi_logloss: 0.575448	Evaluation's multi_logloss: 0.865564
[89]	Train's multi_logloss: 0.574018	Evaluation's multi_logloss: 0.865289
[90]	Train's multi_logloss: 0.572632	Evaluation's multi_logloss: 0.865172
[91]	Train's multi_logloss: 0.569797	Evaluation's multi_logloss: 0.864413
[92]	Train's multi_logloss: 0.566387	Evaluation's multi_logloss: 0.863706
[93]	Train's multi_logloss: 0.564208	Evaluation's multi_logloss: 0.863492
[94]	Train's multi_logloss: 0.562624	Evaluation's multi_logloss: 0.863204
[95]	Train's multi_logloss: 0.56039	Evaluation's multi_logloss: 0.862084
[96]	Train's multi_logloss: 0.558042	Evaluation's multi_logloss: 0.861308
[97]	Train's multi_logloss: 0.555607	Evaluation's multi_logloss: 0.861264
[98]	Train's multi_logloss: 0.553848	Evaluation's multi_logloss: 0.860439
[99]	Train's multi_logloss: 0.551579	Evaluation's multi_logloss: 0.859375
[100]	Train's multi_logloss: 0.548919	Evaluation's multi_logloss: 0.858466
Did not meet early stopping. Best iteration is:
[100]	Train's multi_logloss: 0.548919	Evaluation's multi_logloss: 0.858466
In [178]:
lgb_mn_preds_train = lgb_gbm_nm.predict(x_train.drop(columns = 'quality'))
lgb_mn_preds_test = lgb_gbm_nm.predict(x_test.drop(columns = 'quality'))
In [185]:
lgb_gbm_nm.print_evaluation
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-185-33c0431bb88b> in <module>
----> 1 lgb_gbm_nm.print_evaluation

AttributeError: 'Booster' object has no attribute 'print_evaluation'
In [179]:
best_lgb_preds_train = np.asarray([np.argmax(line) for line in lgb_mn_preds_train])
best_lgb_preds_test = np.asarray([np.argmax(line) for line in lgb_mn_preds_test])
In [180]:
best_lgb_preds_train = pd.DataFrame(best_lgb_preds_train).add_prefix('PRED_QUAL')

best_lgb_preds_test = pd.DataFrame(best_lgb_preds_test).add_prefix('PRED_QUAL')
best_lgb_preds_test.head()
Out[180]:
PRED_QUAL0
0 3
1 3
2 3
3 3
4 3
In [181]:
y_train_df['PRED_QUALITY_LGB'] = best_lgb_preds_train['PRED_QUAL0']
y_train_df['CORRECT_PREDS_LGB'] = np.where(y_train_df['PRED_QUALITY_LGB'] == y_train_df['QUALITYquality'], 1, 0)

y_test_df['PRED_QUALITY_LGB'] = best_lgb_preds_test['PRED_QUAL0']
y_test_df['CORRECT_PREDS_LGB'] = np.where(y_test_df['PRED_QUALITY_LGB'] == y_test_df['QUALITYquality'], 1, 0)
y_test_df.head()
Out[181]:
index QUALITYquality PRED_QUALITY CORRECT_PREDS PRED_QUALITY_LGB CORRECT_PREDS_LGB
0 75 2 3 0 3 0
1 1283 3 3 1 3 1
2 408 3 3 1 3 1
3 1281 3 3 1 3 1
4 1118 3 3 1 3 1
In [182]:
sum(y_train_df['CORRECT_PREDS_LGB'])/len(y_train_df)
Out[182]:
0.804496578690127
In [183]:
sum(y_test_df['CORRECT_PREDS_LGB'])/len(y_test_df)
Out[183]:
0.640625